2  Practical One

2.1 Question 1:

2.1.1 Instructions:

Find all rows in “airquality” that have missing values. Note that the airquality dataset in R is always available (just type airquality in the console to see it).

2.1.2 Answer:

Code
class(airquality)
[1] "data.frame"
Code
airquality_with_missingvalues <<- airquality[
  is.na(airquality$Ozone) | is.na(airquality$Solar.R) | 
    is.na(airquality$Wind) | is.na(airquality$Temp) | 
    is.na(airquality$Month) | is.na(airquality$Day),]

print(airquality_with_missingvalues)
    Ozone Solar.R Wind Temp Month Day
5      NA      NA 14.3   56     5   5
6      28      NA 14.9   66     5   6
10     NA     194  8.6   69     5  10
11      7      NA  6.9   74     5  11
25     NA      66 16.6   57     5  25
26     NA     266 14.9   58     5  26
27     NA      NA  8.0   57     5  27
32     NA     286  8.6   78     6   1
33     NA     287  9.7   74     6   2
34     NA     242 16.1   67     6   3
35     NA     186  9.2   84     6   4
36     NA     220  8.6   85     6   5
37     NA     264 14.3   79     6   6
39     NA     273  6.9   87     6   8
42     NA     259 10.9   93     6  11
43     NA     250  9.2   92     6  12
45     NA     332 13.8   80     6  14
46     NA     322 11.5   79     6  15
52     NA     150  6.3   77     6  21
53     NA      59  1.7   76     6  22
54     NA      91  4.6   76     6  23
55     NA     250  6.3   76     6  24
56     NA     135  8.0   75     6  25
57     NA     127  8.0   78     6  26
58     NA      47 10.3   73     6  27
59     NA      98 11.5   80     6  28
60     NA      31 14.9   77     6  29
61     NA     138  8.0   83     6  30
65     NA     101 10.9   84     7   4
72     NA     139  8.6   82     7  11
75     NA     291 14.9   91     7  14
83     NA     258  9.7   81     7  22
84     NA     295 11.5   82     7  23
96     78      NA  6.9   86     8   4
97     35      NA  7.4   85     8   5
98     66      NA  4.6   87     8   6
102    NA     222  8.6   92     8  10
103    NA     137 11.5   86     8  11
107    NA      64 11.5   79     8  15
115    NA     255 12.6   75     8  23
119    NA     153  5.7   88     8  27
150    NA     145 13.2   77     9  27
Code
write.csv(airquality_with_missingvalues, file = "_raw_data/airquality_with_missingvalues.csv") #saving our manipulated data 

2.2 Question 2:

2.2.1 Instructions:

Find mean, sd, min, max for each of temperature and ozone level.

2.2.2 Answer:

Code
ozone_mean <- mean(airquality$Ozone, na.rm = TRUE)
ozone_mean
[1] 42.12931
Code
ozone_sd <- sd(airquality$Ozone, na.rm = TRUE)
ozone_sd
[1] 32.98788
Code
ozone_min <- min(airquality$Ozone, na.rm = TRUE)
ozone_min
[1] 1
Code
ozone_max <- max (airquality$Ozone, na.rm = TRUE)
ozone_max
[1] 168
Code
temp_mean <- mean(airquality$Temp)
temp_mean
[1] 77.88235
Code
temp_sd <- sd(airquality$Temp)
temp_sd
[1] 9.46527
Code
temp_min <- min(airquality$Temp)
temp_min
[1] 56
Code
temp_max <- max(airquality$Temp)
temp_max
[1] 97

2.3 Question 3:

2.3.1 Instructions:

For linear regression, parameter estimates can be found as follows. β^=(XTX)−1XTY Here, Y is the response variable, and X is the design matrix. The cars data (an R data set, also always available in R) contains two variables: speed and distance to stop. Fit a simple linear regression model to these data, i.e. find the β estimates, using the equation above, and matrix calcuations in R.

2.3.2 Answer:

Code
y_matrix <- cars$dist

x_matrix <- cbind(1, cars$speed)

linear_reg <- function(x, y) {
  # A function that generates all the linear regression estimates 
  #
  # Args:
  #   x - the "x" input data set
  #   y - the observations 
  #
  # Return:
  #   "Beta Estimates:" ; "Estimated Standard Errors:" ; "T-Statistics:" ; "P-Values:" 
  
  num_observations <- nrow(x)
  
  num_variables <- ncol(x)

  beta_hat <- solve(t(x) %*% x) %*% t(x) %*% y   

  regression_residuals <- y - x %*% beta_hat

  rss <- t(regression_residuals) %*% regression_residuals  

  sample_variance <- as.numeric(rss) / (num_observations - num_variables)  

  standard_errors <- sqrt(diag(sample_variance * solve(t(x) %*% x)))  

  t_stats <- beta_hat / standard_errors  

  p_vals <- 2 * (1 - pt(abs(t_stats), num_observations - num_variables))

  return(list(
    "Beta Estimates:" = beta_hat, 
    "Estimated Standard Errors:" = standard_errors, 
    "T-Statistics:" = t_stats, 
    "P-Values:" = p_vals
  ))
}

linear_reg(x_matrix, y_matrix)
$`Beta Estimates:`
           [,1]
[1,] -17.579095
[2,]   3.932409

$`Estimated Standard Errors:`
[1] 6.7584402 0.4155128

$`T-Statistics:`
          [,1]
[1,] -2.601058
[2,]  9.463990

$`P-Values:`
             [,1]
[1,] 1.231882e-02
[2,] 1.489919e-12

2.4 Question 4:

2.4.1 Instructions:

Check that you get the same β estimates as when fitting the linear regression model using lm() in R.

2.4.2 Answers:

Code
reg_model <- lm(dist~speed, data = cars)

summary(reg_model)

Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

As we can see from the summary output, the linear regression function created in question 3 gives the same estimates as the “lm()” function.